Post-training quantization (PTQ), which only requires a tiny dataset for calibration without end-to-end retraining, is a light and practical model compression technique. Recently, several PTQ schemes for vision transformers (ViTs) have been presented; unfortunately, they typically suffer from non-trivial accuracy degradation, especially in low-bit cases. In this paper, we propose RepQ-ViT, a novel PTQ framework for ViTs based on quantization scale reparameterization, to address the above issues. RepQ-ViT decouples the quantization and inference processes, where the former employs complex quantizers and the latter employs scale-reparameterized simplified quantizers. This ensures both accurate quantization and efficient inference, which distinguishes it from existing approaches that sacrifice quantization performance to meet the target hardware. More specifically, we focus on two components with extreme distributions: post-LayerNorm activations with severe inter-channel variation and post-Softmax activations with power-law features, and initially apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference, with only slight accuracy or computational costs. Extensive experiments are conducted on multiple vision tasks with different model variants, proving that RepQ-ViT, without hyperparameters and expensive reconstruction procedures, can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
translated by 谷歌翻译
无数据量化可以潜在地解决模型压缩中的数据隐私和安全问题,因此已得到广泛研究。最近,PSAQ-VIT设计了一个相对值度量,贴片相似性,以生成预训练视觉变压器(VIT)的数据,从而实现了VIT的第一次无数据量化尝试。在本文中,我们提出了PSAQ-VIT V2,这是在PSAQ-VIT之上建立的更准确,无数据的VIT的更准确和无数据的量化框架。更具体地说,按照PSAQ-VIT中的贴片相似性度量,我们引入了一种自适应的教师学生策略,该策略促进了生成的样品的持续环节演变和量化的模型(学生),并在竞争性和互动方式下以竞争性和互动方式进行。完整的模型(教师),因此显着提高了量化模型的准确性。此外,没有辅助类别指导,我们采用了任务和模型独立的先验信息,使通用方案与广泛的视觉任务和模型兼容。对图像分类,对象检测和语义分割任务和PSAQ-VIT V2进行了各种模型进行了广泛的实验,并具有幼稚的量化策略,并且没有访问现实世界数据,从而始终取得了竞争性的结果,显示出潜力作为强大的基线的潜力关于VIT的无数据量化。例如,使用SWIN-S作为(骨干)模型,8位量化达到ImageNet上的82.13 TOP-1精度,50.9盒AP和可可的44.1 Mask AP,而ADE20K上的47.2 miOU。我们希望准确,一般的PSAQ-VIT V2可以作为涉及敏感数据的现实应用程序中的潜在和实践解决方案。代码将在以下网址发布并合并:https://github.com/zkkli/psaq-vit。
translated by 谷歌翻译
视觉变压器(VIT)在各种计算机视觉应用程序上都达到了最先进的性能。但是,这些模型具有相当大的存储和计算开销,使其部署和对边缘设备的有效推断充满了挑战。量化是降低模型复杂性的一种有前途的方法。不幸的是,现有的量化VIT的努力是模拟量化(又称假量化),该量化在推理过程中仍然是浮点算术的,因此对模型加速度无济于事。在本文中,我们提出了I-VIT,即VIT的仅整数量化方案,以使VIT能够使用整数操作和位移动和无浮点操作执行整个推理的计算图。在I-VIT中,线性操作(例如,矩阵和密集)遵循具有二元算术的仅整数管道,而非线性操作(例如,SoftMax,Gelu和Layernorm和Layernorm)近似于提议的轻量级近似算术方法。特别是,I-Vit应用了所提出的ShiftMax和ShiftGelu,它们旨在使用整数位移动来近似相应的浮点操作。我们在各种基准模型上评估了I-VIT,结果表明,仅整数INT8量化具有与完整精确(FP)基线相当(甚至更高)的精度。此外,我们在GPU的整数算术单元上使用TVM进行实用的硬件部署,与FP模型相比,实现了3.72〜4.11 $ \ times $推理的速度。
translated by 谷歌翻译
视觉变压器最近在各种计算机视觉任务上取得了巨大成功。然而,他们的高模型复杂性使部署在资源约束设备上的挑战。量化是一种有效的方法,可以减少模型复杂性,并且可以在模型部署期间解决数据隐私和安全问题的无数据量化已获得广泛的兴趣。不幸的是,所有现有的方法(例如BN正则化)都是为卷积神经网络而设计的,不能应用于具有明显不同模型体系结构的视觉变压器。在本文中,我们提出了PSAQ-VIT,这是视觉变压器的贴片相似性无数据量化框架,以根据视觉变压器的唯一属性来生成“现实”样品,以校准量化参数。具体而言,我们分析了自我发场模块的特性,并在处理高斯噪声和真实图像的处理中揭示了一般差异(斑块相似性)。以上见解指导我们设计一个相对值度量,以优化高斯噪声以近似真实的图像,然后将其用于校准量化参数。对各种基准进行了广泛的实验和消融研究,以验证PSAQ-VIT的有效性,这甚至可以优于实现DATA驱动的方法。
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译
Large training data and expensive model tweaking are standard features of deep learning for images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which raises privacy concerns. Existing solutions are either too expensive to be practical or do not sufficiently protect the confidentiality of data and models. In this paper, we study and compare novel \emph{image disguising} mechanisms, DisguisedNets and InstaHide, aiming to achieve a better trade-off among the level of protection for outsourced DNN model training, the expenses, and the utility of data. DisguisedNets are novel combinations of image blocktization, block-level random permutation, and two block-level secure transformations: random multidimensional projection (RMT) and AES pixel-level encryption (AES). InstaHide is an image mixup and random pixel flipping technique \cite{huang20}. We have analyzed and evaluated them under a multi-level threat model. RMT provides a better security guarantee than InstaHide, under the Level-1 adversarial knowledge with well-preserved model quality. In contrast, AES provides a security guarantee under the Level-2 adversarial knowledge, but it may affect model quality more. The unique features of image disguising also help us to protect models from model-targeted attacks. We have done an extensive experimental evaluation to understand how these methods work in different settings for different datasets.
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译
Solving real-world optimal control problems are challenging tasks, as the system dynamics can be highly non-linear or including nonconvex objectives and constraints, while in some cases the dynamics are unknown, making it hard to numerically solve the optimal control actions. To deal with such modeling and computation challenges, in this paper, we integrate Neural Networks with the Pontryagin's Minimum Principle (PMP), and propose a computationally efficient framework NN-PMP. The resulting controller can be implemented for systems with unknown and complex dynamics. It can not only utilize the accurate surrogate models parameterized by neural networks, but also efficiently recover the optimality conditions along with the optimal action sequences via PMP conditions. A toy example on a nonlinear Martian Base operation along with a real-world lossy energy storage arbitrage example demonstrates our proposed NN-PMP is a general and versatile computation tool for finding optimal solutions. Compared with solutions provided by the numerical optimization solver with approximated linear dynamics, NN-PMP achieves more efficient system modeling and higher performance in terms of control objectives.
translated by 谷歌翻译
The task of reconstructing 3D human motion has wideranging applications. The gold standard Motion capture (MoCap) systems are accurate but inaccessible to the general public due to their cost, hardware and space constraints. In contrast, monocular human mesh recovery (HMR) methods are much more accessible than MoCap as they take single-view videos as inputs. Replacing the multi-view Mo- Cap systems with a monocular HMR method would break the current barriers to collecting accurate 3D motion thus making exciting applications like motion analysis and motiondriven animation accessible to the general public. However, performance of existing HMR methods degrade when the video contains challenging and dynamic motion that is not in existing MoCap datasets used for training. This reduces its appeal as dynamic motion is frequently the target in 3D motion recovery in the aforementioned applications. Our study aims to bridge the gap between monocular HMR and multi-view MoCap systems by leveraging information shared across multiple video instances of the same action. We introduce the Neural Motion (NeMo) field. It is optimized to represent the underlying 3D motions across a set of videos of the same action. Empirically, we show that NeMo can recover 3D motion in sports using videos from the Penn Action dataset, where NeMo outperforms existing HMR methods in terms of 2D keypoint detection. To further validate NeMo using 3D metrics, we collected a small MoCap dataset mimicking actions in Penn Action,and show that NeMo achieves better 3D reconstruction compared to various baselines.
translated by 谷歌翻译
A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
translated by 谷歌翻译